Latent User Prediction

Authors
Affiliation
Shuangfei Li

University of Lausanne

Zhiqi Feng
Published

May 8, 2023

Abstract

The following machine learning project focuses on…

1 Introduction

1.1 Context

With the rapidly growth of the internet, online activities (such as online shopping) become more and more popular, e-commerce starts being an integral part of our lives. With so many unknown factors that come with not only selling but also providing good customer service, it is the key to success for companies to understand consumer preferences and behavior on the e-commerce platform in order to grow their business.

Therefore, our project aims to learn customers behavior based on the data collected from an online C2C fashion store launched in Europe around 2009 and to make a prediction of user latency in terms of improving user activities on e-commerce platform.

1.2 Objective

The main purpose of our project is to predict whether users are latent on this platform, so that after this E-commerce platform can take measures to improve the user activities according to our research result.

2 Planned Analysis

– Supervised learning

o Models: KNN,Decision Trees, Random Forest, and Neural Network

o Splitting strategy: cross validation-80% training set and 20% test set using stratified sampling (if the dataset is too large, we will reduce to 70% training set and 30% test set)

o Metrics: accuracy, sensitivity (recall), Confusion Matrix and AUC-ROC

– Unsupervised learning: K-means clustering

3 Data

3.1 Source

We found a dataset about the user behavior of “global C2C online fashion store” from Kaggle Web.The dataset contains records of over 9 million registered users fromthis successful online C2C fashion store that were launched in Europe around 2009 and subsequently expanded globally. The link of this dataset is “https://www.kaggle.com/datasets/thedevastator/global-c2c-fashion-store-user-behaviour-analysis”.

userdataset_orig <- read.csv(here::here("data/users.dataset.public.csv"))

3.2 Description

  • identifierHash: Hash of the user’s id

  • type: The type of entity

  • country: User’s country (written in french)

  • language: The user’s preferred language.

  • socialNbFollowers: Number of users who subscribed to this user’s activity. New accounts are automatically followed by the store’s official accounts.

  • socialNbFollows: Number of user account this user follows. New accounts are automatically assigned to follow the official partners.

  • socialProductsLiked: Number of products this user liked.

  • productsListed: Number of currently unsold products that this user has uploaded.

  • productsSold: Number of products this user has sold.

  • productsPassRate: % of products meeting the product description. (Sold products are reviewed by the store’s team before being shipped to the buyer.)

  • productsWished: Number of products this user added to his/her wishlist.

  • productsBought: Number of products this user bought.

  • gender: user’s gender

  • civilityGenderId: civility title as integer, “1” means “mr”, “2” means “mrs” and “3” means “miss”.

  • civilityTitle: Civility title

  • hasAnyApp: User has ever used any of the store’s official app.

  • hasAndroidApp: User has ever used the official Android app.

  • hasIosApp: User has ever used the official iOS app.

  • hasProfilePicture: User has a custom profile picture.

  • daysSinceLastLogin: Number of days since the last login.

  • seniority: Number of days since the user registered.

  • seniorityAsMonths: see seniority. in months

  • seniorityAsYears: see seniority. in years

  • countryCode: user’s country

We defined “daysSinceLastLogin” as an indicator of latent users. We assume that users whose days since last login are greater than 180 days are latent users.

3.3 Cleaning

3.3.1 Data selection

Due to the large amount of data, we decided to randomly select 30,000 data and create a new dataset for the prediction.

userdataset <- sample_n(userdataset_orig, 30000, replace = FALSE)

(delete the usefulness columns and the columns express the same things)

By inspecting the dataset, we found that the columns named “index”, “identifierHash” and “type” do not make sense, so we decided to remove these columns.

In addition, the columns named “country” and “countryCode” have the same meaning, so we kept only the column “country”.

Then, we noticed that character variable “hasAnyApp” is equal to “hasAndroidApp” plus “hasIosApp”. Since we are not looking at the difference between users using “Android app” and “Ios app”, these two columns are meaningless to us. Therefore we removed the columns “hasAndroidApp” and “hasIosApp”.

Finally, the columns “civilityTitle” and “civilityGenderld” both represent “title”. As the data type of the “civilityTitle” column is character but the data type of the “civilityGenderId” column is numeric, we prefered to keep numeric data type which is “civilityGenderId”.

userdataset <- userdataset %>% select(-2,-3, -16, -18, -19, -23, -24, -25)

3.3.2 Data transform

3.3.2.1 language

According to the dataset, we noticed that “language” is a character data, for example “en” which is difficult to be used in the setting models. Hence we transformed it into a numeric data.

Firstly, we created a new dataframe, and we created 5 new columns based on this column which we named them “langEn”, “langFr”, “langDe”, “langEs” and “langIt”. Secondly, we considered whether the user chooses the language as their first language based on the raw data, true for yes and false for no. And we determined “0” for “False”, “1” for ““true.

language <- userdataset %>%
  select(index, language) %>%
  mutate(langEn = ifelse(language == "en", TRUE, FALSE),
         langFr = ifelse(language == "fr", TRUE, FALSE),
         langDe = ifelse(language == "de", TRUE, FALSE),
         langEs = ifelse(language == "es", TRUE, FALSE),
         langIt = ifelse(language == "it", TRUE, FALSE)) 

userdataset1 <- left_join(userdataset, language, by = "index")

userdataset1$langEn <- ifelse(userdataset1$langEn == TRUE,1,0)
userdataset1$langDe <- ifelse(userdataset1$langDe == TRUE,1,0)
userdataset1$langFr <- ifelse(userdataset1$langFr == TRUE,1,0)
userdataset1$langEs <- ifelse(userdataset1$langEs == TRUE,1,0)
userdataset1$langIt <- ifelse(userdataset1$langIt == TRUE,1,0)

3.3.2.2 gender

We observed that the data of “gender” is also a character data, such as “F”. So we processed the “gender” data the same way we did with the “language” data。

We created 2 new columns based on this column in a new dataframe, whcih are named “genderFemale” and “genderMale”. Then we filled in this 2 columns with “ture” or “false” by the data in the column “gender” and used “0” for “False”, “1” for ““true.

gender <- userdataset1 %>%
  select(index, gender) %>%
  mutate(genderFamale = ifelse(gender == "F", TRUE, FALSE),
         genderMale = ifelse(gender == "M", TRUE, FALSE))

userdataset2 <- merge(userdataset1, gender, by = "index" )


userdataset2$genderFamale <- ifelse(userdataset2$genderFamale == TRUE,1,0)
userdataset2$genderMale <- ifelse(userdataset2$genderMale == TRUE,1,0)

3.3.2.3 hasAnyApp

From the dataset, we considered the data “hasAnyApp” is a logical value. So we transformed it to numeric value by setting “0” for “False” and “1” for ““true.

userdataset2$hasAnyApp <- ifelse(userdataset2$hasAnyApp == "True",1,0)

3.3.2.4 hasProfilePicture

Similarliy, we used “0” for “False” and “1” for ““true” in order to turn “hasProfilePicture” into a numeric value.

userdataset2$hasProfilePicture <- ifelse(userdataset2$hasProfilePicture == "True",1,0)

3.3.2.5 latent

“days since last login” is the basis how we determine a user is latent. We assume users who have been logged in for more than 180 days as latent users. We then converted this data to a numeric, defining 0 for False, 1 for True.

userdataset2$latent <- ifelse(userdataset2$daysSinceLastLogin>180, TRUE, FALSE)
userdataset2$latent <- ifelse(userdataset2$latent == TRUE,1,0)

3.3.3 Remove unused column

After combing the data, we finally got the dataset “userdataset2”. After the above data processing, we observed that there are still unnecessary columns and we prefered to hide them (which are index, language.x, gender.x, language.y, gender.y).

Userdataset <- userdataset2 %>% select(-1, -3, -12, -18, -24)

3.4 Missing value

There is no missing value in this dataset.

check_na <- is.na(Userdataset)
sum(check_na)
[1] 0
  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

4 Exploratory data analysis

library(summarytools)

Attaching package: 'summarytools'
The following object is masked from 'package:tibble':

    view
library(DataExplorer)

4.1 Data inspection

4.1.1 Data check

In this part, we focus on the structure of all types of data, and simple graphical analysis. It can be seen from the results that the distribution of a lot of data is unbalnce, so later, we use different charts for analysis in terms of different types of data.

str(Userdataset)
'data.frame':   30000 obs. of  22 variables:
 $ country            : chr  "Suède" "France" "Royaume-Uni" "Italie" ...
 $ socialNbFollowers  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ socialNbFollows    : int  8 8 8 8 8 8 8 8 8 8 ...
 $ socialProductsLiked: int  0 0 4 0 0 370 0 27 0 0 ...
 $ productsListed     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ productsSold       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ productsPassRate   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ productsWished     : int  0 0 0 0 0 1 0 0 9 0 ...
 $ productsBought     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ civilityGenderId   : int  1 1 2 2 2 2 2 1 2 2 ...
 $ hasAnyApp          : num  1 1 0 1 0 0 0 1 1 1 ...
 $ hasProfilePicture  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ daysSinceLastLogin : int  689 709 591 709 558 42 709 669 463 676 ...
 $ seniority          : int  3205 3205 3205 3205 3205 3205 3205 3205 3205 3205 ...
 $ langEn             : num  1 1 1 0 1 1 1 0 1 1 ...
 $ langFr             : num  0 0 0 1 0 0 0 0 0 0 ...
 $ langDe             : num  0 0 0 0 0 0 0 1 0 0 ...
 $ langEs             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ langIt             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ genderFamale       : num  0 0 1 1 1 1 1 0 1 1 ...
 $ genderMale         : num  1 1 0 0 0 0 0 1 0 0 ...
 $ latent             : num  1 1 1 1 1 0 1 1 1 1 ...
summary(Userdataset)
   country          socialNbFollowers socialNbFollows     socialProductsLiked
 Length:30000       Min.   :  3.000   Min.   :    0.000   Min.   :    0.00   
 Class :character   1st Qu.:  3.000   1st Qu.:    8.000   1st Qu.:    0.00   
 Mode  :character   Median :  3.000   Median :    8.000   Median :    0.00   
                    Mean   :  3.454   Mean   :    9.013   Mean   :    5.33   
                    3rd Qu.:  3.000   3rd Qu.:    8.000   3rd Qu.:    0.00   
                    Max.   :744.000   Max.   :13764.000   Max.   :51671.00   
 productsListed      productsSold      productsPassRate  productsWished    
 Min.   :  0.0000   Min.   :  0.0000   Min.   :  0.000   Min.   :   0.000  
 1st Qu.:  0.0000   1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:   0.000  
 Median :  0.0000   Median :  0.0000   Median :  0.000   Median :   0.000  
 Mean   :  0.1045   Mean   :  0.1242   Mean   :  0.799   Mean   :   1.549  
 3rd Qu.:  0.0000   3rd Qu.:  0.0000   3rd Qu.:  0.000   3rd Qu.:   0.000  
 Max.   :217.0000   Max.   :170.0000   Max.   :100.000   Max.   :1916.000  
 productsBought     civilityGenderId   hasAnyApp     hasProfilePicture
 Min.   :  0.0000   Min.   :1.000    Min.   :0.000   Min.   :0.0000   
 1st Qu.:  0.0000   1st Qu.:2.000    1st Qu.:0.000   1st Qu.:1.0000   
 Median :  0.0000   Median :2.000    Median :0.000   Median :1.0000   
 Mean   :  0.1821   Mean   :1.776    Mean   :0.268   Mean   :0.9807   
 3rd Qu.:  0.0000   3rd Qu.:2.000    3rd Qu.:1.000   3rd Qu.:1.0000   
 Max.   :405.0000   Max.   :3.000    Max.   :1.000   Max.   :1.0000   
 daysSinceLastLogin   seniority        langEn           langFr      
 Min.   :    11.0   Min.   :2852   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:   569.0   1st Qu.:2857   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :   694.0   Median :3196   Median :1.0000   Median :0.0000  
 Mean   :   678.5   Mean   :3063   Mean   :0.5224   Mean   :0.2633  
 3rd Qu.:   702.0   3rd Qu.:3201   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :737028.0   Max.   :3205   Max.   :1.0000   Max.   :1.0000  
     langDe            langEs            langIt         genderFamale   
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000  
 Median :0.00000   Median :0.00000   Median :0.00000   Median :1.0000  
 Mean   :0.07243   Mean   :0.06077   Mean   :0.08103   Mean   :0.7709  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
   genderMale         latent      
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:1.0000  
 Median :0.0000   Median :1.0000  
 Mean   :0.2291   Mean   :0.9016  
 3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000  
print(dfSummary(Userdataset,style="grid",
                plain.ascii = FALSE, 
                tmp.img.dir = "/tmp",
                graph.magnif = 0.8),
      method = "render")

Data Frame Summary

Dimensions: 30000 x 22
Duplicates: 13275
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 country [character]
1. France
2. Etats-Unis
3. Royaume-Uni
4. Italie
5. Allemagne
6. Espagne
7. Australie
8. Suède
9. Danemark
10. Pays-Bas
[ 145 others ]
7566(25.2%)
6289(21.0%)
3450(11.5%)
2499(8.3%)
1977(6.6%)
1734(5.8%)
821(2.7%)
558(1.9%)
543(1.8%)
477(1.6%)
4086(13.6%)
30000 (100.0%) 0 (0.0%)
2 socialNbFollowers [integer]
Mean (sd) : 3.5 (5.5)
min ≤ med ≤ max:
3 ≤ 3 ≤ 744
IQR (CV) : 0 (1.6)
56 distinct values 30000 (100.0%) 0 (0.0%)
3 socialNbFollows [integer]
Mean (sd) : 9 (95.1)
min ≤ med ≤ max:
0 ≤ 8 ≤ 13764
IQR (CV) : 0 (10.6)
53 distinct values 30000 (100.0%) 0 (0.0%)
4 socialProductsLiked [integer]
Mean (sd) : 5.3 (302.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 51671
IQR (CV) : 0 (56.8)
244 distinct values 30000 (100.0%) 0 (0.0%)
5 productsListed [integer]
Mean (sd) : 0.1 (2.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 217
IQR (CV) : 0 (23.8)
37 distinct values 30000 (100.0%) 0 (0.0%)
6 productsSold [integer]
Mean (sd) : 0.1 (2.3)
min ≤ med ≤ max:
0 ≤ 0 ≤ 170
IQR (CV) : 0 (18.2)
47 distinct values 30000 (100.0%) 0 (0.0%)
7 productsPassRate [numeric]
Mean (sd) : 0.8 (8.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 100
IQR (CV) : 0 (10.5)
46 distinct values 30000 (100.0%) 0 (0.0%)
8 productsWished [integer]
Mean (sd) : 1.5 (24.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 1916
IQR (CV) : 0 (16.1)
157 distinct values 30000 (100.0%) 0 (0.0%)
9 productsBought [integer]
Mean (sd) : 0.2 (3.3)
min ≤ med ≤ max:
0 ≤ 0 ≤ 405
IQR (CV) : 0 (18.4)
43 distinct values 30000 (100.0%) 0 (0.0%)
10 civilityGenderId [integer]
Mean (sd) : 1.8 (0.4)
min ≤ med ≤ max:
1 ≤ 2 ≤ 3
IQR (CV) : 0 (0.2)
1:6873(22.9%)
2:22986(76.6%)
3:141(0.5%)
30000 (100.0%) 0 (0.0%)
11 hasAnyApp [numeric]
Min : 0
Mean : 0.3
Max : 1
0:21961(73.2%)
1:8039(26.8%)
30000 (100.0%) 0 (0.0%)
12 hasProfilePicture [numeric]
Min : 0
Mean : 1
Max : 1
0:580(1.9%)
1:29420(98.1%)
30000 (100.0%) 0 (0.0%)
13 daysSinceLastLogin [integer]
Mean (sd) : 678.5 (8505.9)
min ≤ med ≤ max:
11 ≤ 694 ≤ 737028
IQR (CV) : 133 (12.5)
700 distinct values 30000 (100.0%) 0 (0.0%)
14 seniority [integer]
Mean (sd) : 3062.9 (168.5)
min ≤ med ≤ max:
2852 ≤ 3196 ≤ 3205
IQR (CV) : 344 (0.1)
19 distinct values 30000 (100.0%) 0 (0.0%)
15 langEn [numeric]
Min : 0
Mean : 0.5
Max : 1
0:14327(47.8%)
1:15673(52.2%)
30000 (100.0%) 0 (0.0%)
16 langFr [numeric]
Min : 0
Mean : 0.3
Max : 1
0:22100(73.7%)
1:7900(26.3%)
30000 (100.0%) 0 (0.0%)
17 langDe [numeric]
Min : 0
Mean : 0.1
Max : 1
0:27827(92.8%)
1:2173(7.2%)
30000 (100.0%) 0 (0.0%)
18 langEs [numeric]
Min : 0
Mean : 0.1
Max : 1
0:28177(93.9%)
1:1823(6.1%)
30000 (100.0%) 0 (0.0%)
19 langIt [numeric]
Min : 0
Mean : 0.1
Max : 1
0:27569(91.9%)
1:2431(8.1%)
30000 (100.0%) 0 (0.0%)
20 genderFamale [numeric]
Min : 0
Mean : 0.8
Max : 1
0:6873(22.9%)
1:23127(77.1%)
30000 (100.0%) 0 (0.0%)
21 genderMale [numeric]
Min : 0
Mean : 0.2
Max : 1
0:23127(77.1%)
1:6873(22.9%)
30000 (100.0%) 0 (0.0%)
22 latent [numeric]
Min : 0
Mean : 0.9
Max : 1
0:2952(9.8%)
1:27048(90.2%)
30000 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2023-05-08

4.1.2 Data Attribute Modification

We will divide the data in the data table into two categories, numeric and factor. We will attribute the columns “country”, “civilityGenderId”, “hasAnyApp”, “hasProfilePicture”, “langEn”, “langFr”, “langDe”, “langEs”, “langIt”, “genderFamale”, “genderMale” from change numeric to factor. The attributes of the remaining columns are changed from integer to numeric.

library(purrr)
library(dplyr)
library(DT)

Userdataset <- as.data.frame(Userdataset)

Userdataset_factor <- Userdataset %>% select(1, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21) %>% colnames()
for(i in Userdataset_factor){
    Userdataset[,i] <- as.factor(Userdataset[,i])
} 

Userdataset_numeric <- Userdataset %>% select(2, 3, 4, 5, 6, 8, 9, 14) %>% colnames()
for(i in Userdataset_numeric){
    Userdataset[,i] <- as.numeric(Userdataset[,i])
} 

        
Userdataset %>% datatable(rownames = FALSE,
                     option = list(scrollX = T,
                                   pageLength = 5))
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

4.2 Data description by using descriptive statistics and graphs

4.2.1 Graphs for “latent”

The goal of our project is to build the model to predict whether users are latent on this platform. So here we make a comparison of users with latent status and non-latent status.

latent <- Userdataset %>% ggplot(mapping = aes(x=latent,fill=latent))+geom_bar()+scale_color_identity()
latent
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

As can be seen from the figure, the data results are extremely unbalanced. However, the result of generating this graph may be caused by our random screening of data. So we will try to use group selection to improve this.

4.2.2 Graphs for numeric variables

4.2.2.1 Bar-chart

We use bar-chart to display numerical data in order to see the distribution of data.

#Continuous Variables


Userdataset %>% select(productsPassRate) %>% 
keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As shown in the figure, since most users do not have this data, the data is extremely unbalanced as shown in the figure.

#Discrete variables

Userdataset %>% select(2, 3, 4, 5, 6, 8, 9, 14) %>% 
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_bar()

As shown in the graphs, since the result of each variable is extremely scattered and the difference is too large, we cannot visualize them well.

4.2.2.2 Box chart

Userdataset %>% select(2, 3, 4, 5, 6, 7, 8, 9, 14) %>% 
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()

Since the results of the bar chart are not obvious, we tried to make a box graph to see the difference of the data. However, due to the imbalance of the data, the results are not satisfactory.

4.2.3 Graphs for non-numeric variables

4.2.3.1 Graph for country variables

Concerning about country, we created a new dataframe, named country. In this list, based on the country column, we create a new column that counts the number of occurrences of each country in the Userdataset. The new column represents the number of users who are in that country. Next we use different graphs to express this part of the data.

# Bar chart

country <- Userdataset %>% 
  group_by(country) %>%
  summarise(user_count = n())

 p1 <- ggplot(data = country,aes(x = user_count, y = country))+
  geom_col()
p1

From the figure, we can see that the distribution of users in different countries is not very even, and because there are too many countries, we decided to display the data by using heat map.

# Heatmap

library(ggplot2)
library(maps)

world_map <- map_data("world")

country <- aggregate(user_count ~ country, data = country, FUN = sum)

country_map <- merge(world_map, country, by.x = "region", by.y = "country", all.x = TRUE)

country_heatmap <- ggplot(country_map, aes(x = long, y = lat, group = group, fill = user_count)) +
  geom_polygon(color = "gray", size = 0.2) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "User Count Heatmap", x = "", y = "") +
  theme_void()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Display the heatmap
print(country_heatmap)

From the map, we can see that the users of this C2C website are located very scattered, mainly in North America, Europe, West Asia and South Africa. Among them, most user accounts come from European.

4.2.3.2 Graph for language variables

Considering languages, we display this variable by using histograms.

a <- unlist(Userdataset$langEn) 
langEn <- sum(a=="1")
b <- unlist(Userdataset$langFr) 
langFr <- sum(b=="1")
c <- unlist(Userdataset$langDe) 
langDe <- sum(c=="1")
d <- unlist(Userdataset$langEs) 
langEs <- sum(d=="1")
e <- unlist(Userdataset$langIt)
langIt <- sum(e=="1")

language1 <- c("langEn","langFr","LangDe","langEs","langIt")
number <- c(langEn,langFr,langDe,langEs,langIt)
LANGUAGE_DATA <- data.frame(language1,number)
p2 <- ggplot(LANGUAGE_DATA,aes(x=reorder(language1,number),y=number,fill=language1))+geom_col()+
  geom_text(aes(label = number), vjust = 1.5, colour = "white", position = position_dodge(.9), size = 5)
p2

We can conclude from the graph that the majority of users using English as their first language, and the least number of users are Spanish.

4.2.3.3 Graph for gender varibles

We also use a histogram to display “gender” variable. It is obviously that female users are far more than male ones.

f <- unlist(Userdataset$genderFamale) 
genderFamale <- sum(f=="1")
g <- unlist(Userdataset$genderMale) 
genderMale <- sum(g=="1")

gender1 <- c("genderFamale", "genderMale")
number1 <- c(genderFamale,genderMale)
GENDER_DATA <- data.frame(gender1,number1)
p3 <- ggplot(GENDER_DATA,aes(x=reorder(gender1,number1),y=number1,fill=gender1))+geom_col()+
  geom_text(aes(label = number1), vjust = 1.5, colour = "white", position = position_dodge(.9), size = 5)
p3

4.2.3.4 Graph for other variables

We also use bar chart to show all other variables.

Userdataset %>% select(civilityGenderId, hasAnyApp, hasProfilePicture) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_bar()
Warning: attributes are not identical across measure variables;
they will be dropped

We can see that the majority of users are with “Mrs” titles. Only a small number of people have downloaded the relevant app of this C2C e-commercial platform. Besides, the vast majority of users have profile pictures.

4.2.3.5 Box plot

As for non-numeric variables, in addition to the above classification analysis, we also use box plots to see the data distribution.

Userdataset %>% select(10, 11, 12, 15,16,17,18,19, 20, 21) %>% 
  keep(is.factor) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()
Warning: attributes are not identical across measure variables;
they will be dropped

4.3 Examination of the relationship between variables

To observe the relationship between all of the variables, we made the “Multivariate correlation scatter matrix plots” and “Correlation plot” for numeric data.

and “Mosaic chart” for Non-numeric data.

4.3.1 Multivariate correlation scatter matrix plots - Nurmeric

library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: 'GGally'
The following object is masked from 'package:ggmosaic':

    happy
Userdataset %>% select(2, 3, 4, 5, 6, 8, 9, 13, 14) %>% ggpairs()

4.3.2 Correlation plot - Numeric

UserdatasetP <- Userdataset %>% select(,-22,-13)
plot_correlation(UserdatasetP, type= 'c', cor_args = list( 'use' = 'complete.obs'))

从2个图中,我们可以看到各个numeric variables之间的相关性。显而易见地,Var.seniority和其他有变量基本都没有什么相关性。而var.productsSold,socialProductsLiked,socialNbFollows和socialNbFollowers之间的相关性更强。

From these two graphs, we can observe the correlation between the numeric variables. Obviously, “seniority” has basically no correlation with other variables. Whereas “productsSold”, “socialProductsLiked”, “socialNbFollows” and “socialNbFollowers” have stronger correlations among each other.

4.3.3 Mosaic Chart - Non-numeric

We use mosaic plot to show the correlation of each non-numeric variable with latency.

cat_table <- Userdataset %>% select(10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22) 
plot_list=list()
for (i in 1:11){
  print(mosaicplot(~ cat_table[[i]] + cat_table[[11]], 
                   data = cat_table,
                   main = "Mosaic Plot between latent and each variable",
                   color = TRUE, 
                   ylab=paste0("variable_", colnames(cat_table)[i])))
  Sys.sleep(2)
}

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

5 Next step

About the data:

  1. Due to the large size of original dataset, running memory was insufficient during data processing, so we randomly filtered the data for analysis. In order to ensure the structure of the data is consistent with the original dataset, we decided to use group sampling (considering countries as a grouping criteria) to select the data.

  2. Whether the original data or the data we selected, there are a large amount number of latent users. In order to obtain more accurate research results, we will consider to segment user types (set refinement indicators) in the next step, such as latent users belong to the value of daysSinceLastLogin between [90, 365]. While those who have not logged in for more than 365 days will be regarded as missing user. Which means they will not log in to this website by default in the future, so that there is no need to deal with them.

Questions:

  1. When processing the data, considering that we will use different models after, we replaced all text data by numbers. Is this necessary to do so?

  2. For the country column, we keep them as “character” due to the large number of countries counts. What if it do not support the “character” data when we set models after, how should we deal with the data of the “country” column?

  3. In the section of displaying numeric data, we consider the results of bar-chart and box-chart are not very good. Do you have any suggestions for visualization? For example, what other charts can be used to better display numeric data?

4.According to eda, we noticed the correlation between many variables is not very strong, and we are afraid it will be difficult to work the project out. Do you have any suggestions for the next steps?